Initial commit - FLUJOS codebase (production branch)

Includes: FLUJOS app (Node/Flask/Python), FLUJOS_DATOS scripts (scrapers, Keras, Django) Excludes: MongoDB, scraped data, Wikipedia/WikiLeaks dumps, Python venv, node_modules
2026-03-31 14:10:02 +02:00 · 2026-03-31 14:10:02 +02:00 · a40b946163
commit a40b946163
158 changed files with 196645 additions and 0 deletions
--- a/FLUJOS_DATOS/NOTICIAS/DOCS/arquitectura_main.txt
+++ b/FLUJOS_DATOS/NOTICIAS/DOCS/arquitectura_main.txt
@ -0,0 +1,497 @@
+                   ┌───────────────────────────────────────────────────┐
+                   │                      main()                       │
+                   └──────────────┬────────────────────────────────────┘
+                                  │
+                     Lista de URLs de medios y leaks
+                                  │
+                                  ▼
+        ┌───────────────────────────────────────────────────────────────┐
+        │ register_processed_notifications(base_folder, urls)          │
+        │  - Crea/lee processed_articles.txt                           │
+        │  - Filtra duplicados                                         │
+        └──────────────┬───────────────────────────────────────────────┘
+                       │ urls_to_process
+     ┌─────────────────┴───────────────────────────┐
+     │                                             │
+     ▼                                             ▼
+┌──────────────────────────────┐          ┌──────────────────────────────┐
+│ explore_and_extract_articles │          │   explore_wayback_machine    │
+│ (crawling + extracción)      │          │ (consulta Wayback y extrae)  │
+└────────────┬─────────────────┘          └────────────┬─────────────────┘
+             │                                          │
+             │ guarda                                   │ guarda
+             ▼                                          ▼
+   ┌────────────────────┐                      ┌────────────────────┐
+   │  articulos/ (txt)  │                      │  articulos/ (txt)  │
+   └────────────────────┘                      └────────────────────┘
+             ▲                                          ▲
+             │                                          │
+             ▼                                          ▼
+   ┌────────────────────┐                      ┌────────────────────┐
+   │  archivos/ (bin)   │  ← descarga         │  (no aplica)       │
+   └────────────────────┘                      └────────────────────┘
+
+            ┌────────────────────────────────────────────────┐
+            │ process_files(archivos/ → tokenized/)          │
+            │ tokenize_all_articles(articulos/ → tokenized/) │
+            └──────────────┬─────────────────────────────────┘
+                           │
+                           ▼
+                  ┌────────────────────┐
+                  │  tokenized/ (ids)  │  ← ids BERT (máx 512)
+                  └────────────────────┘
+
+                           ▼
+                ┌──────────────────────────┐
+                │  get_folder_info + logs  │
+                └──────────────────────────┘
+
+Entrada: url raíz, folders, processed_urls, size_limit, depth=0..6
+                                │
+                                ▼
+                    ┌──────────────────────────┐
+                    │ HTMLSession().get(url)   │
+                    │ response.html.render()   │  ← render JS (headless)
+                    └─────────────┬────────────┘
+                                  │
+                                  ▼
+                       Conjunto de absolute_links
+                                  │
+                 ┌────────────────┴─────────────────┐
+                 │                                  │
+     Si extensión conocida                 Si página HTML genérica
+    (.pdf .csv .txt .xlsx .docx            (sin extensión/otra cosa)
+     .html .md .zip)                        │
+                 │                          ▼
+                 ▼                ┌──────────────────────────────┐
+ ┌──────────────────────────┐     │  extract_and_save_article()  │
+ │ download_and_save_file() │     │  - parse <title>, <p>        │
+ │  → archivos/             │     │  - translate → clean → txt   │
+ └───────────┬──────────────┘     │  - guardar en articulos/     │
+             │                    └──────────────┬───────────────┘
+             │ (tras cada acción)               │
+             ▼                                   ▼ (recursivo)
+  get_folder_info(articulos/, archivos/)  explore_and_extract_articles(link, depth+1)
+             │
+             ▼
+   ¿total_size >= 50GB? ──► Sí: detener │ No: continuar
+
+
+archivos/  ──────────────────────────────────────────────────────┐
+                                                                  ▼
+                 Para cada archivo según extensión:
+        ┌─────────────────────────────────────────────────────────────┐
+        │ .pdf  → read_pdf()      → texto                             │
+        │ .csv  → read_csv()      → texto                             │
+        │ .txt  → open().read()   → texto                             │
+        │ .docx → read_docx()     → texto                             │
+        │ .xlsx → read_xlsx()     → texto                             │
+        │ .zip  → read_zip()      → texto concatenado                 │
+        │ .html/.md → read_html_md() → format_content()(*) → texto    │
+        └─────────────────────────────────────────────────────────────┘
+                                   │
+                                   ▼
+                     translate_text(deep_translator → 'es')
+                                   │
+                                   ▼
+            clean_text( BeautifulSoup strip + minúsculas
+                        + quita URLs + solo letras/espacios
+                        + colapsa espacios + STOPWORDS_ES )
+                                   │
+                                   ▼
+         tokenize_and_save(texto_limpio, filename, tokenized/)
+                                   │
+                                   ▼
+                    tokenized/ contiene IDs BERT (máx 512)
+(*) Nota: `format_content()` está vacío en tu snippet; hoy actúa como no-op.
+
+
+
+
+
+
+
+articulos/ (txt limpios/es) ─────► para cada .txt:
+                                     │
+                                     ▼
+           tokenizer.encode(text, max_length=512, add_special_tokens=True)
+                                     │
+                                     ▼
+                    'id id id ...' → escribe en tokenized/ con
+                       mismo nombre de archivo
+
+
+
+clean_filename(name)
+  - reemplaza \ / * ? : " < > | por "_"
+  - espacios → '_'
+  - corta a 100 chars
+
+register_processed_notifications(base_folder, urls)
+  - lee base_folder/processed_articles.txt (si existe)
+  - devuelve solo URLs no vistas
+  - añade nuevas al final (append)
+
+explore_wayback_machine(url)
+  - GET http://archive.org/wayback/available?url=...
+  - si hay 'closest' → extract_and_save_article(archive_url)
+
+
+
+translate_text(text) ──► GoogleTranslator(auto→'es') ──► texto_en_es
+                                   │
+                                   ▼
+clean_text():
+  1) quita CDATA variantes:  <! [ CDATA [ ... ] ] >
+  2) BeautifulSoup → .get_text()
+  3) lower()
+  4) elimina URLs (regex http\S+)
+  5) deja solo letras españolas y espacio (regex)
+  6) colapsa espacios
+  7) filtra STOPWORDS (lista extensa ES)
+
+
+/var/www/theflows.net/flujos/FLUJOS_DATOS/NOTICIAS/
+├── articulos/      (txt limpios en ES, de páginas HTML/Wayback)
+├── archivos/       (descargas crudas: pdf, csv, xlsx, docx, zip, html, md, txt)
+├── tokenized/      (mismos nombres, contenido = IDs BERT separados por espacio)
+└── processed_articles.txt (histórico de URLs procesadas)
+
+
+logging.basicConfig(level=INFO)
+- Traza etapas (descarga, extracción, tokenización)
+- Resumen final:
+    * nº ficheros en articulos/, archivos/, tokenized/
+    * tamaños totales (MB)
+- Cortafuegos de tamaño: 50 GB entre articulos/ + archivos/
+
+
+
+[ main() ]
+   │
+   ▼
+register_processed_notifications()
+   │  → filtra URLs ya procesadas
+   ▼
+---------------------------+
+| urls_to_process (nuevas) |
+---------------------------+
+   │
+   ▼
+explore_and_extract_articles()
+   │
+   ├─► Si enlace a archivo (.pdf, .csv, .txt, .xlsx, .docx, .html, .md, .zip)
+   │       └─► download_and_save_file() → guarda en /archivos
+   │
+   ├─► Si enlace HTML → extract_and_save_article()
+   │       ├─ traduce (translate_text)
+   │       ├─ limpia (clean_text)
+   │       └─ guarda .txt en /articulos
+   │
+   └─► Recursivo hasta max_depth o límite 50 GB
+         │
+         ▼
+explore_wayback_machine() → descarga versión archivada si existe
+   │
+   ▼
+process_files(/archivos → /tokenized)
+   │  ├─ read_pdf/csv/docx/xlsx/zip/html_md/txt
+   │  ├─ translate_text()
+   │  ├─ clean_text()
+   │  └─ tokenize_and_save() con BERT
+   │
+   ▼
+tokenize_all_articles(/articulos → /tokenized)
+   │  └─ encode con BERT en IDs separados por espacio
+   │
+   ▼
+get_folder_info() → logs resumen final
+
+
+===========================================================
+Flujo simplificado de procesamiento de un archivo
+===========================================================
+
+archivo descargado
+   │
+   ▼
+read_*() según extensión
+   │
+   ▼
+translate_text()  → GoogleTranslator(auto→es)
+   │
+   ▼
+clean_text()
+   ├─ elimina CDATA y HTML
+   ├─ minúsculas, sin URLs, solo letras/es
+   ├─ colapsa espacios
+   └─ filtra stopwords ES
+   │
+   ▼
+tokenize_and_save()
+   ├─ tokenizer.encode(max 512 tokens)
+   └─ guarda IDs BERT en /tokenized
+
+
+===========================================================
+Estructura de carpetas
+===========================================================
+
+NOTICIAS/
+├── articulos/       ← .txt limpios en español
+├── archivos/        ← binarios crudos descargados
+├── tokenized/       ← tokens BERT (IDs)
+└── processed_articles.txt ← historial de URLs
+
+
+===========================================================
+Control y límites
+===========================================================
+
+- Profundidad máxima de crawling: max_depth = 6
+- Tamaño combinado artículos+archivos: límite 50 GB
+- Evita duplicados con processed_articles.txt
+- Logs detallados en consola
+
+
+
+████████████████████████████████████  FLUJOS: ESQUEMA TÉCNICO (ASCII)  ████████████████████████████████████
+
+[ ENTORNO / DEPENDENCIAS ]
+- GoogleTranslator (deep_translator)         → API web de Google Translate (auto→es)
+- requests / requests_html.HTMLSession       → HTTP + renderizado JS (chromium/headless)
+- BeautifulSoup (bs4)                        → Parseo HTML, extracción de texto
+- PyPDF2.PdfReader                           → Extracción texto de PDFs (si embebido/copiable)
+- openpyxl                                   → Lectura de .xlsx
+- python-docx (docx)                         → Lectura de .docx
+- zipfile                                    → Descompresión y lectura (texto) de ficheros en ZIP
+- html2text (no usado aquí en format_content)→ [placeholder]
+- transformers.BertTokenizer                 → Tokenizador BERT ES (dccuchile/bert-base-spanish-wwm-cased)
+- tqdm, logging, os, re, json, time, csv, hashlib, urllib.parse (urlparse/urljoin)  → utilidades
+
+[ ESTRUCTURA DE CARPETAS (I/O) ]
+/var/www/theflows.net/flujos/FLUJOS_DATOS/NOTICIAS/
+├── articulos/        (TXT limpios en español, derivados de HTML/Wayback)
+├── archivos/         (descargas crudas: .pdf .csv .txt .xlsx .docx .html .md .zip)
+├── tokenized/        (TXT con IDs de tokens BERT, máx 512 tokens por archivo)
+└── processed_articles.txt   (histórico de URLs ya procesadas → evita duplicados)
+
+=============================================================================================================
+[ main() ]
+- Inicializa: URLs objetivo, rutas base, límite de 50GB, carpetas si no existen
+- Flujo:
+    1) urls_to_process = register_processed_notifications(base_folder, urls)
+    2) Para url en urls_to_process:
+         a) explore_and_extract_articles(url, articulos/, archivos/, processed_urls, size_limit)
+         b) explore_wayback_machine(url, articulos/)
+    3) process_files(archivos/ → tokenized/)
+    4) tokenize_all_articles(articulos/ → tokenized/)
+    5) get_folder_info() sobre cada carpeta y logging resumen
+- Side effects: Escritura en articulos/, archivos/, tokenized/, processed_articles.txt; logs INFO
+
+=============================================================================================================
+[ register_processed_notifications(base_folder, urls) ]
+- Lee/crea processed_articles.txt
+- Devuelve: lista de URLs no presentes (nuevas)
+- Efectos:
+    * append de nuevas URLs al fichero
+- Riesgos:
+    * Fichero grande con el tiempo (puede optimizarse a DB/Set persistente)
+    * No bloquea concurrencia (carreras si hay procesos paralelos)
+
+=============================================================================================================
+[ explore_and_extract_articles(url, articulos/, archivos/, processed_urls, size_limit, depth=0..6) ]
+- Hace GET + render JS:
+    session = HTMLSession(); response = session.get(url); response.html.render()
+- Obtiene absolute_links (con JS resuelto)
+- Para cada link:
+    * Si ya está en processed_urls → skip
+    * Marca link como procesado (set in-memory)
+    * Detecta extensión: [.pdf .csv .txt .xlsx .docx .html .md .zip]
+        - Coincide → download_and_save_file(link, archivos/)
+        - mailto:/tel: → ignora
+        - Otro/HTML → extract_and_save_article(link, articulos/); recursión depth+1
+- Control de tamaño:
+    * get_folder_info(articulos/) + get_folder_info(archivos/) → si ≥ 50GB → cortar
+- Notas:
+    * render() requiere Chromium instalado y recursos; costoso en CPU/RAM
+    * Cuidado con sitios SPA/anti-bot; timeouts (30s)
+    * max_depth=6 limita explosión de enlaces; se puede poner filtro de dominio
+
+=============================================================================================================
+[ download_and_save_file(url, archivos/) ]
+- Descarga streaming (chunk 8192) con requests.get(url, stream=True, timeout=30)
+- Filename = clean_filename(último segmento URL) || 'archivo_descargado'
+- Escribe binario en archivos/
+- Errores:
+    * response.status_code != 200 → log
+    * timeouts/conexión → log
+- Seguridad:
+    * No ejecuta nada; sólo guarda
+    * Riesgo: HTML/JS guardado como .html/.md puede contener scripts (pero se procesa como texto después)
+
+=============================================================================================================
+[ extract_and_save_article(url, articulos/) ]
+- GET simple (requests.get, timeout=30)
+- Parse HTML: <title> y todos los <p> → concatena texto
+- Procesa:
+    * translate_text() (auto→es)
+    * clean_text()
+- Nombre archivo:
+    * title → clean_filename(title) + '.txt'
+    * fallback: último segmento de path URL + '.txt'
+- Guarda .txt en articulos/
+- Riesgos:
+    * Páginas con contenido en divs/aria/role no capturado por <p> → menos texto
+    * Limitaciones del traductor (cuotas, longitudes, errores temporales)
+    * Si content vacío → log y skip
+
+=============================================================================================================
+[ explore_wayback_machine(url, articulos/) ]
+- Consulta API: http://archive.org/wayback/available?url={url}
+- Si hay 'closest' → archive_url → extract_and_save_article(archive_url)
+- Usos:
+    * Resiliencia ante 404/robots o contenido rotativo
+- Riesgos:
+    * No todas las páginas están archivadas
+    * Rate limits
+
+=============================================================================================================
+[ process_files(archivos/, tokenized/) ]
+- Itera archivos descargados por extensión:
+    .pdf  → read_pdf()      (PdfReader.extract_text por página)
+    .csv  → read_csv()      (csv.reader → " ".join(row))
+    .txt  → open().read()   (texto tal cual)
+    .docx → read_docx()     (docx.Document → concat párrafos)
+    .xlsx → read_xlsx()     (openpyxl → concat celdas por fila)
+    .zip  → read_zip()      (abre cada entrada, decode utf-8 ignore)
+    .html/.md → read_html_md() → format_content()  [*format_content está vacía → no-op]
+- Para cada contenido (si hay texto):
+    translate_text() → clean_text() → tokenize_and_save(cleaned, filename, tokenized/)
+- Notas:
+    * Archivos binarios dentro del ZIP no-UTF8 se ignoran por decode errors (ignore)
+    * PDF sin capa de texto → extract_text() puede devolver None
+    * XLSX grande → memoria/tiempo; iter_rows() es razonable
+
+=============================================================================================================
+[ tokenize_all_articles(articulos/, tokenized/) ]
+- Para cada .txt en articulos/:
+    tokenizer.encode(text, truncation=True, max_length=512, add_special_tokens=True)
+    → 'ids' separados por espacio → guarda con mismo filename en tokenized/
+- Notas:
+    * Truncation a 512 tokens: se pierde contenido largo (considerar sliding windows)
+    * add_special_tokens=True añade [CLS]/[SEP]
+
+=============================================================================================================
+[ tokenize_and_save(text, filename, tokenized/) ]
+- Encapsula la llamada a tokenizer.encode(...) con truncation=512
+- Crea tokenized/ si no existe
+- Escribe "id id id ..." en archivo de salida
+- Riesgos:
+    * Diferente encoding de entrada → normalizado por clean_text()
+    * Si filename colisiona con otro (mismo nombre) → se sobrescribe
+
+=============================================================================================================
+[ translate_text(text) ]
+- GoogleTranslator(source='auto', target='es').translate(text)
+- Devuelve texto traducido o el original si error (catch + log)
+- Limitaciones:
+    * Longitudes excesivas → errores (“Text length need to be between 0 and 5000”)
+      - Solución futura: fragmentar en bloques y recomponer
+    * Rate limits/cambios API
+
+=============================================================================================================
+[ clean_text(text) ]
+1) Quita CDATA: regex '<!\[\s*CDATA\s*\[.*?\]\]>' con flags=re.S
+2) BeautifulSoup(text, 'html.parser').get_text(separator=" ")
+3) lower()
+4) Elimina URLs: regex r'http\S+'
+5) Deja sólo letras españolas y espacios: r'[^a-záéíóúñü\s]' → ''
+6) Colapsa espacios: r'\s+' → ' ' + strip()
+7) Filtra STOPWORDS (set ES) palabra a palabra
+- Resulta en texto normalizado listo para BERT
+- Notas:
+    * Pierde números, signos y acentos raros fuera de set
+    * STOPWORDS puede ajustarse por dominio (noticias vs. técnico)
+
+=============================================================================================================
+[ read_pdf(pdf_path) ]
+- Abre en binario, PdfReader(f)
+- Recorre páginas → page.extract_text() → concat + '\n'
+- Devuelve string (puede estar vacío)
+- Limitaciones:
+    * PDFs escaneados → sin OCR (no texto)
+    * Layouts complejos → texto desordenado
+
+[ read_csv(csv_path) ]
+- csv.reader → por cada fila ' '.join(row) + '\n'
+- Simple y robusto; no maneja tipos/formato especial
+
+[ read_docx(docx_path) ]
+- docx.Document → concat paragraph.text + '\n'
+- Pierde estilos/tablas; conserva sólo texto base
+
+[ read_xlsx(xlsx_path) ]
+- openpyxl.load_workbook → por cada hoja → por cada fila
+- ' '.join(str(cell.value or '')) + '\n'
+- Pierde formato/tipos; sólo valores en orden de fila
+
+[ read_zip(zip_path) ]
+- zipfile.ZipFile → recorre cada entry
+- z.open(filename).read().decode('utf-8', errors='ignore')
+- Concatena todo a un sólo string
+- Peligros: ZIP enorme → memoria; entries binarias → ignoradas por decode
+
+[ read_html_md(file_path) ]
+- open(file, 'utf-8', errors='replace').read()
+- Retorna string crudo (sin limpieza HTML aquí)
+- format_content() se invoca después (actualmente vacío)
+
+[ format_content(html_content) ]
+- [PLACEHOLDER] En el snippet está sin implementar.
+- Potencial:
+    * html2text → Markdown plano
+    * Limpieza de scripts/estilos/menus
+    * Normalización de espacios/entidades
+- Hoy actúa como NO-OP (debería rellenarse)
+
+=============================================================================================================
+[ get_page_title(url) ]
+- GET(url, timeout=10) → BeautifulSoup → <title>.text.strip()
+- Devuelve None si falla o no hay <title>
+- Usado para nombrar archivos de artículos
+
+[ clean_filename(name) ]
+- Reemplaza caracteres prohibidos [\/*?:"<>|] por "_"
+- Espacios → "_"; corta a 100 chars
+- Evita errores en FS; normaliza nombres
+
+[ get_folder_info(path) ]
+- Recorre recursivo → suma tamaño de ficheros y cuenta
+- Devuelve (total_size_bytes, total_files)
+- Usado para métricas y para detener por límite
+
+=============================================================================================================
+[ LOGGING / MÉTRICAS / LÍMITES ]
+- logging.INFO por etapas (descargar, extraer, traducir, limpiar, tokenizar)
+- Límite de tamaño: 50GB (archivos + artículos) → detiene crawling
+- Resumen final:
+    * Artículos descargados (# y MB)
+    * Archivos descargados (# y MB)
+    * Archivos tokenizados (# y MB)
+- Sugerencias:
+    * Añadir manejo de reintentos/backoff a requests
+    * Cache de traducciones por hash (ahorro de coste/tiempo)
+    * Paralelización controlada (cola + límites I/O/CPU)
+    * Particionar tokenized/ por subcarpetas si #ficheros crece
+
+=============================================================================================================
+[ DATA FLOW (RESUMEN) ]
+URLs  ──► register_processed_notifications ──► explore_* (HTMLSession/render/links)
+   └────────► extract_and_save_article ──► translate_text ─► clean_text ─► articulos/*.txt
+   └────────► download_and_save_file ─────────────────────────────────────► archivos/*
+archivos/* ──► process_files (read_* → translate → clean → tokenize) ──► tokenized/*
+articulos/*.txt ──► tokenize_all_articles ───────────────────────────────► tokenized/*
+tokenized/*, articulos/*, archivos/* ──► get_folder_info + logs
+
+████████████████████████████████████████████████████████████████████████████████████
--- a/FLUJOS_DATOS/NOTICIAS/docs.txt
+++ b/FLUJOS_DATOS/NOTICIAS/docs.txt
@ -0,0 +1,111 @@
+# Descripción del Proyecto
+
+Este proyecto se encarga de extraer, limpiar, y tokenizar artículos y archivos de diversas fuentes web. El programa realiza las siguientes tareas principales:
+
+1. **Extracción de artículos**: Extrae contenido de artículos desde sitios web especificados.
+2. **Descarga de archivos**: Descarga archivos en diferentes formatos como PDF, CSV, TXT, XLSX, DOCX, HTML, MD, y ZIP.
+3. **Procesamiento de archivos**: Lee el contenido de los archivos descargados y los prepara para la tokenización.
+4. **Tokenización**: Tokeniza el contenido de los artículos y archivos para su posterior análisis.
+
+# Estructura del Proyecto
+
+- `main_noticias.py`: Script principal que coordina todas las tareas.
+- `noticias_utils.py`: Contiene las funciones auxiliares para la extracción, descarga, limpieza, lectura, procesamiento y tokenización de los archivos.
+- `articulos/`: Directorio donde se guardan los artículos extraídos.
+- `archivos/`: Directorio donde se guardan los archivos descargados.
+
+# Paquetes Necesarios
+
+Para que este proyecto funcione correctamente, se deben instalar los siguientes paquetes de Python:
+
+- `requests`
+- `beautifulsoup4`
+- `transformers`
+- `PyPDF2`
+- `docx`
+- `openpyxl`
+- `urllib3`
+
+# Comandos para Instalar los Paquetes
+
+Primero, asegúrate de tener pip actualizado:
+
+```bash
+pip install --upgrade pip
+
+
+pip install requests beautifulsoup4 transformers PyPDF2 python-docx openpyxl urllib3
+
+
+Luego, instala los paquetes necesarios:
+
+bash
+
+pip install requests beautifulsoup4 transformers PyPDF2 python-docx openpyxl urllib3
+
+Creación y Activación del Entorno Virtual
+
+Dado que ya existe un entorno virtual llamado myenv en la carpeta FLUJOS_DATOS, puedes activarlo para evitar conflictos. Aquí están los pasos para crear y activar un entorno virtual, si es necesario.
+Creación del Entorno Virtual
+
+Si necesitas crear un nuevo entorno virtual, sigue estos pasos:
+
+bash
+
+cd ~/PROGRAMACION/FLUJOS_TODO/FLUJOS_DATOS
+python3 -m venv myenv
+
+Activación del Entorno Virtual
+
+Para activar el entorno virtual myenv, utiliza los siguientes comandos:
+
+En Linux/MacOS:
+
+bash
+
+source ~/PROGRAMACION/FLUJOS_TODO/FLUJOS_DATOS/myenv/bin/activate
+
+En Windows (cmd):
+
+cmd
+
+myenv\Scripts\activate
+
+En Windows (PowerShell):
+
+powershell
+
+myenv\Scripts\Activate.ps1
+
+Una vez activado el entorno virtual, podrás instalar los paquetes necesarios y ejecutar los scripts.
+Ejecución del Programa
+
+    Asegúrate de que el entorno virtual está activado.
+    Navega hasta la carpeta NOTICIAS:
+
+bash
+
+cd ~/PROGRAMACION/FLUJOS_TODO/FLUJOS_DATOS/NOTICIAS
+
+    Ejecuta el script principal:
+
+bash
+
+python main_noticias.py
+
+El programa extraerá, descargará, procesará y tokenizará los artículos y archivos según las fuentes web especificadas en el script.
+Notas Adicionales
+
+    Asegúrate de que las carpetas articulos y archivos existen antes de ejecutar el script.
+    Puedes modificar las URLs y las configuraciones en main_noticias.py y noticias_utils.py según tus necesidades específicas.
+
+Contacto
+
+Para cualquier duda o problema con el script, por favor, contacta con el administrador del proyecto.
+
+r
+
+
+Este `docs.txt` proporciona una guía clara y detallada sobre cómo configurar y ejecutar el proyecto, incluyendo todos los comandos necesarios para instalar los paquetes y configurar el entorno virtual.
+
+digo la parte del entorno virtual y todos estos pasos
--- a/FLUJOS_DATOS/NOTICIAS/main_noticias.py
+++ b/FLUJOS_DATOS/NOTICIAS/main_noticias.py
@ -0,0 +1,625 @@
+from deep_translator import GoogleTranslator
+from deep_translator import GoogleTranslator
+import os
+import re
+import hashlib
+import requests
+import json
+import time
+import logging
+from requests_html import HTMLSession
+from bs4 import BeautifulSoup
+from PyPDF2 import PdfReader
+import csv
+import docx
+import openpyxl
+import zipfile
+import html2text
+from transformers import BertTokenizer
+from tqdm import tqdm
+from urllib.parse import urlparse, urljoin
+
+# Configuración de logging para mostrar información en la terminal
+logging.basicConfig(level=logging.INFO)
+
+# Inicializar el tokenizador de BERT en español
+tokenizer = BertTokenizer.from_pretrained('dccuchile/bert-base-spanish-wwm-cased')
+
+# Lista de stopwords en español
+STOPWORDS = set([
+    "de", "la", "que", "el", "en", "y", "a", "los", "del", "se", "las", "por",
+    "un", "para", "con", "no", "una", "su", "al", "es", "lo", "como", "más",
+    "pero", "sus", "le", "ya", "o", "fue", "este", "ha", "sí", "porque",
+    "esta", "son", "entre", "cuando", "muy", "sin", "sobre", "también", "me",
+    "hasta", "hay", "donde", "quien", "desde", "todo", "nos", "durante",
+    "todos", "uno", "les", "ni", "contra", "otros", "ese", "eso", "ante",
+    "ellos", "e", "esto", "mí", "antes", "algunos", "qué", "unos", "yo",
+    "otro", "otras", "otra", "él", "tanto", "esa", "estos", "mucho",
+    "quienes", "nada", "muchos", "cual", "poco", "ella", "estar", "estas",
+    "algunas", "algo", "nosotros", "mi", "mis", "tú", "te", "ti", "tu",
+    "tus", "ellas", "nosotras", "vosotros", "vosotras", "os", "mío", "mía",
+    "míos", "mías", "tuyo", "tuya", "tuyos", "tuyas", "suyo", "suya",
+    "suyos", "suyas", "nuestro", "nuestra", "nuestros", "nuestras",
+    "vuestro", "vuestra", "vuestros", "vuestras", "esos", "esas", "estoy",
+    "estás", "está", "estamos", "estáis", "están", "esté", "estés",
+    "estemos", "estéis", "estén", "estaré", "estarás", "estará",
+    "estaremos", "estaréis", "estarán", "estaría", "estarías",
+    "estaríamos", "estaríais", "estarían", "estaba", "estabas",
+    "estábamos", "estabais", "estaban", "estuve", "estuviste", "estuvo",
+    "estuvimos", "estuvisteis", "estuvieron", "estuviera", "estuvieras",
+    "estuviéramos", "estuvierais", "estuvieran", "estuviese",
+    "estuvieses", "estuviésemos", "estuvieseis", "estuviesen", "estando",
+    "estado", "estada", "estados", "estadas", "estad"
+])
+
+
+def translate_text(text):
+    """
+    Traduce el texto completo usando deep-translator.
+    """
+    try:
+        return GoogleTranslator(source='auto', target='es').translate(text)
+    except Exception as e:
+        logging.error(f"Error al traducir con deep-translator: {e}")
+        return text
+
+def clean_text(text):
+    """
+    Limpia el texto eliminando bloques CDATA (incluso con espacios extra),
+    luego HTML, puntuación y stopwords.
+    """
+    # 1) Eliminar cualquier variante de CDATA (p. ej. '<![ CDATA [ ... ]]>')
+    text = re.sub(r'<\!\[\s*CDATA\s*\[.*?\]\]>', '', text, flags=re.S)
+
+    # 2) Parsear HTML
+    soup = BeautifulSoup(text, 'html.parser')
+    text = soup.get_text(separator=" ")
+
+    # 3) Minúsculas
+    text = text.lower()
+
+    # 4) Eliminar URLs
+    text = re.sub(r'http\S+', '', text)
+
+    # 5) Quitar todo menos letras y espacios
+    text = re.sub(r'[^a-záéíóúñü\s]', '', text)
+
+    # 6) Unir múltiples espacios
+    text = re.sub(r'\s+', ' ', text).strip()
+
+    # 7) Eliminar stopwords
+    words = text.split()
+    filtered = [w for w in words if w not in STOPWORDS]
+
+    return ' '.join(filtered)
+
+def tokenize_and_save(text, filename, destination_folder):
+
+    # → Tu lógica de tokenización con el tokenizer BERT
+    tokens = tokenizer.encode(
+        text,
+        truncation=True,
+        max_length=512,
+        add_special_tokens=True
+    )
+    tokens_str = ' '.join(map(str, tokens))
+
+    # Nos aseguramos de que el directorio destino existe
+    os.makedirs(destination_folder, exist_ok=True)
+
+    # Usamos filename **tal cual** para el fichero de salida
+    out_path = os.path.join(destination_folder, filename)
+
+    with open(out_path, 'w', encoding='utf-8') as f:
+        f.write(tokens_str)
+
+
+def tokenize_all_articles(articles_folder, destination_folder):
+    """
+    Tokeniza todos los artículos en la carpeta especificada.
+    """
+    if not os.path.exists(destination_folder):
+        os.makedirs(destination_folder)
+
+    logging.info("Iniciando proceso de tokenización...")
+    total_articles = 0
+    total_size = 0
+
+    for root, dirs, files in os.walk(articles_folder):
+        for file in files:
+            if file.endswith('.txt'):
+                file_path = os.path.join(root, file)
+                with open(file_path, 'r', encoding='utf-8') as f:
+                    content = f.read()
+                    tokenize_and_save(content, file, destination_folder)
+                    total_articles += 1
+                    total_size += os.path.getsize(file_path)
+
+    total_size_mb = total_size / (1024 * 1024)
+    logging.info(f"Tokenización completada para {total_articles} artículos.")
+    logging.info(f"Tamaño total de artículos tokenizados: {total_size_mb:.2f} MB.")
+
+def read_pdf(pdf_path):
+    """
+    Lee y extrae texto de un archivo PDF.
+    """
+    content = ''
+    try:
+        with open(pdf_path, 'rb') as f:
+            pdf_reader = PdfReader(f)
+            for page in pdf_reader.pages:
+                text = page.extract_text()
+                if text:
+                    content += text + '\n'
+    except Exception as e:
+        logging.error(f"Error al leer PDF {pdf_path}: {e}")
+    return content
+
+def read_csv(csv_path):
+    """
+    Lee y extrae texto de un archivo CSV.
+    """
+    content = ''
+    try:
+        with open(csv_path, 'r', encoding='utf-8') as f:
+            reader = csv.reader(f)
+            for row in reader:
+                content += ' '.join(row) + '\n'
+    except Exception as e:
+        logging.error(f"Error al leer CSV {csv_path}: {e}")
+    return content
+
+def read_docx(docx_path):
+    """
+    Lee y extrae texto de un archivo DOCX.
+    """
+    content = ''
+    try:
+        doc = docx.Document(docx_path)
+        for paragraph in doc.paragraphs:
+            content += paragraph.text + '\n'
+    except Exception as e:
+        logging.error(f"Error al leer DOCX {docx_path}: {e}")
+    return content
+
+def read_xlsx(xlsx_path):
+    """
+    Lee y extrae texto de un archivo XLSX.
+    """
+    content = ''
+    try:
+        wb = openpyxl.load_workbook(xlsx_path)
+        for sheet in wb.sheetnames:
+            ws = wb[sheet]
+            for row in ws.iter_rows():
+                row_text = ' '.join([str(cell.value) if cell.value is not None else '' for cell in row])
+                content += row_text + '\n'
+    except Exception as e:
+        logging.error(f"Error al leer XLSX {xlsx_path}: {e}")
+    return content
+
+def read_zip(zip_path):
+    """
+    Lee y extrae texto de un archivo ZIP.
+    """
+    content = ''
+    try:
+        with zipfile.ZipFile(zip_path, 'r') as z:
+            for filename in z.namelist():
+                with z.open(filename) as f:
+                    file_content = f.read().decode('utf-8', errors='ignore')
+                    content += file_content + '\n'
+    except Exception as e:
+        logging.error(f"Error al leer ZIP {zip_path}: {e}")
+    return content
+
+def read_html_md(file_path):
+    """
+    Lee y extrae texto de un archivo HTML o Markdown.
+    """
+    content = ''
+    try:
+        with open(file_path, 'r', encoding='utf-8', errors='replace') as f:
+            content = f.read()
+    except Exception as e:
+        logging.error(f"Error al leer HTML/MD {file_path}: {e}")
+    return content
+
+def format_content(html_content):
+    """
+    Convierte contenido HTML a texto plano.
+    """
+    h = html2text.HTML2Text()
+    h.ignore_links = True
+    h.ignore_images = True
+    text = h.handle(html_content)
+    return text
+
+def process_files(files_folder, destination_folder):
+    """
+    Procesa y tokeniza todos los archivos en la carpeta especificada.
+    """
+    if not os.path.exists(destination_folder):
+        os.makedirs(destination_folder)
+
+    logging.info("Procesando archivos descargados...")
+    total_files = 0
+    total_size = 0
+
+    for root, dirs, files in os.walk(files_folder):
+        for file in files:
+            file_path = os.path.join(root, file)
+            content = ''
+
+            if file.endswith('.pdf'):
+                content = read_pdf(file_path)
+            elif file.endswith('.csv'):
+                content = read_csv(file_path)
+            elif file.endswith('.txt'):
+                try:
+                    with open(file_path, 'r', encoding='utf-8') as f:
+                        content = f.read()
+                except Exception as e:
+                    logging.error(f"Error al leer TXT {file_path}: {e}")
+            elif file.endswith('.docx'):
+                content = read_docx(file_path)
+            elif file.endswith('.xlsx'):
+                content = read_xlsx(file_path)
+            elif file.endswith('.zip'):
+                content = read_zip(file_path)
+            elif file.endswith('.html') or file.endswith('.md'):
+                content = read_html_md(file_path)
+                content = format_content(content)
+            else:
+                logging.info(f"Formato de archivo no soportado: {file}")
+                continue
+
+            if content:
+                translated_text = translate_text(content)
+                cleaned_text = clean_text(translated_text)
+                tokenize_and_save(cleaned_text, file, destination_folder)
+                total_files += 1
+                total_size += os.path.getsize(file_path)
+
+    total_size_mb = total_size / (1024 * 1024)
+    logging.info(f"Procesamiento completado para {total_files} archivos.")
+    logging.info(f"Tamaño total de archivos procesados: {total_size_mb:.2f} MB.")
+
+def download_and_save_file(url, destination_folder):
+    """
+    Descarga y guarda un archivo desde la URL especificada.
+    """
+    try:
+        logging.info(f"Descargando archivo: {url}")
+        response = requests.get(url, stream=True, timeout=30)
+        if response.status_code == 200:
+            filename = clean_filename(url.split('/')[-1])
+            if not filename:
+                filename = 'archivo_descargado'
+            file_path = os.path.join(destination_folder, filename)
+            with open(file_path, 'wb') as f:
+                for chunk in response.iter_content(chunk_size=8192):
+                    f.write(chunk)
+            logging.info(f"Archivo descargado: {file_path}")
+        else:
+            logging.info(f"Error al descargar {url}: Código de estado {response.status_code}")
+    except Exception as e:
+        logging.error(f"Error al descargar {url}: {e}")
+
+def extract_and_save_article(url, articles_folder):
+    """
+    Extrae y guarda el contenido de un artículo desde la URL especificada.
+    """
+    try:
+        logging.info(f"Extrayendo artículo: {url}")
+        response = requests.get(url, timeout=30)
+        if response.status_code == 200:
+            soup = BeautifulSoup(response.content, 'html.parser')
+            title_tag = soup.find('title')
+            title = title_tag.get_text().strip() if title_tag else None
+            paragraphs = soup.find_all('p')
+            content = ' '.join([para.get_text() for para in paragraphs])
+
+            if content.strip():
+                translated_text = translate_text(content)
+                cleaned_text = clean_text(translated_text)
+                if title:
+                    filename = clean_filename(title) + '.txt'
+                else:
+                    parsed_url = urlparse(url)
+                    filename = clean_filename(parsed_url.path.split('/')[-1]) + '.txt'
+
+                file_path = os.path.join(articles_folder, filename)
+
+                with open(file_path, 'w', encoding='utf-8') as f:
+                    f.write(cleaned_text)
+
+                logging.info(f"Artículo guardado: {file_path}")
+            else:
+                logging.info(f"No se encontró contenido en {url}")
+        else:
+            logging.info(f"Error al acceder a {url}: Código de estado {response.status_code}")
+    except Exception as e:
+        logging.error(f"Error al extraer artículo de {url}: {e}")
+
+def get_page_title(url):
+    """
+    Obtiene el título de la página web desde la URL especificada.
+    """
+    try:
+        response = requests.get(url, timeout=10)
+        if response.status_code == 200:
+            soup = BeautifulSoup(response.content, 'html.parser')
+            title_tag = soup.find('title')
+            return title_tag.get_text().strip() if title_tag else None
+        else:
+            return None
+    except Exception as e:
+        logging.error(f"Error al obtener el título de la página {url}: {e}")
+        return None
+
+def clean_filename(name):
+    """
+    Limpia el nombre del archivo eliminando caracteres no permitidos.
+    """
+    if name is None:
+        return 'sin_nombre'
+    name = re.sub(r'[\\/*?:"<>|]', "_", name)
+    name = re.sub(r'\s+', '_', name)
+    return name[:100]
+
+def register_processed_notifications(base_folder, urls):
+    """
+    Registra las URLs ya procesadas para evitar duplicados.
+    """
+    if not os.path.exists(base_folder):
+        os.makedirs(base_folder)
+
+    txt_path = os.path.join(base_folder, "processed_articles.txt")
+    processed_urls = set()
+
+    if os.path.exists(txt_path):
+        with open(txt_path, 'r') as f:
+            processed_urls = set(f.read().splitlines())
+
+    urls_to_process = [url for url in urls if url not in processed_urls]
+
+    with open(txt_path, 'a') as f:
+        for url in urls_to_process:
+            f.write(url + "\n")
+
+    if processed_urls:
+        logging.info(f"Artículos ya procesados: {len(processed_urls)}")
+    else:
+        logging.info("No hay artículos procesados previamente.")
+
+    return urls_to_process
+
+def explore_wayback_machine(url, articles_folder):
+    """
+    Explora la Wayback Machine para obtener versiones archivadas de la URL.
+    """
+    try:
+        logging.info(f"Explorando Wayback Machine para: {url}")
+        api_url = f"http://archive.org/wayback/available?url={url}"
+        response = requests.get(api_url, timeout=10)
+        data = response.json()
+
+        if 'archived_snapshots' in data and 'closest' in data['archived_snapshots']:
+            archive_url = data['archived_snapshots']['closest']['url']
+            logging.info(f"Descargando desde Wayback Machine: {archive_url}")
+            extract_and_save_article(archive_url, articles_folder)
+        else:
+            logging.info(f"No se encontró versión archivada para {url}")
+    except Exception as e:
+        logging.error(f"Error al explorar Wayback Machine para {url}: {e}")
+
+def get_folder_info(path):
+    """
+    Obtiene información de la carpeta: tamaño total y número de archivos.
+    """
+    total_size = 0
+    total_files = 0
+    for dirpath, dirnames, filenames in os.walk(path):
+        for f in filenames:
+            fp = os.path.join(dirpath, f)
+            total_size += os.path.getsize(fp)
+            total_files += 1
+    return total_size, total_files
+
+def explore_and_extract_articles(url, articles_folder, files_folder, processed_urls, size_limit, depth=0, max_depth=6):
+    """
+    Explora y extrae artículos y archivos desde la URL especificada.
+    """
+    if depth > max_depth:
+        return
+
+    logging.info(f"Explorando {url} en profundidad {depth}...")
+    try:
+        session = HTMLSession()
+        response = session.get(url, timeout=30)
+        response.html.render(timeout=30, sleep=1)
+        links = response.html.absolute_links
+        session.close()
+    except Exception as e:
+        logging.error(f"Error al acceder a {url}: {e}")
+        return
+
+    for link in links:
+        if link in processed_urls:
+            continue
+
+        processed_urls.add(link)
+
+        parsed_link = urlparse(link)
+        file_extension = os.path.splitext(parsed_link.path)[1].lower()
+
+        if file_extension in ['.pdf', '.csv', '.txt', '.xlsx', '.docx', '.html', '.md', '.zip']:
+            download_and_save_file(link, files_folder)
+        elif 'mailto:' in link or 'tel:' in link:
+            continue
+        else:
+            extract_and_save_article(link, articles_folder)
+            explore_and_extract_articles(link, articles_folder, files_folder, processed_urls, size_limit, depth + 1, max_depth)
+
+        total_size_articles, _ = get_folder_info(articles_folder)
+        total_size_files, _ = get_folder_info(files_folder)
+        total_size = total_size_articles + total_size_files
+
+        if total_size >= size_limit:
+            logging.info("Se ha alcanzado el límite de tamaño de 50 GB. Deteniendo exploración.")
+            return
+
+def main():
+    logging.info("Función: main")
+
+    urls = [
+        'https://reactionary.international/database/',
+        'https://aleph.occrp.org/',
+        'https://offshoreleaks.icij.org/',
+        'https://www.publico.es/',
+        'https://www.elsaltodiario.com/',
+        'https://www.nytimes.com/',
+        'https://www.theguardian.com/',
+        'https://www.lemonde.fr/',
+        'https://www.spiegel.de/',
+        'https://elpais.com/',
+        'https://www.repubblica.it/',
+        'https://www.scmp.com/',
+        'https://www.smh.com.au/',
+        'https://www.globo.com/',
+        'https://timesofindia.indiatimes.com/',
+        'https://www.asahi.com/',
+        'https://www.washingtonpost.com/',
+        'https://www.aljazeera.com/',
+        'https://www.folha.uol.com.br/',
+        'https://www.telegraph.co.uk/',
+        'https://www.corriere.it/',
+        'https://www.clarin.com/',
+        'https://www.eluniversal.com.mx/',
+        'https://www.welt.de/',
+        'https://www.lanacion.com.ar/',
+        'https://www.bbc.com/',
+        'https://www.elconfidencial.com/',
+        'https://www.expansion.com/',
+        'https://www.lavanguardia.com/',
+        'https://www.elperiodico.com/',
+        'https://www.abc.es/',
+        'https://www.elespanol.com/',
+        'https://www.lainformacion.com/',
+        'https://www.elcorreo.com/',
+        'https://www.canarias7.es/',
+        'https://www.diariovasco.com/',
+        'https://www.farodevigo.es/',
+        'https://www.lavozdegalicia.es/',
+        'https://www.marca.com/',
+        'https://www.mundodeportivo.com/',
+        'https://www.elmundo.es/',
+        'https://www.cnbc.com/',
+        'https://www.bloomberg.com/',
+        'https://www.forbes.com/',
+        'https://www.economist.com/',
+        'https://www.ft.com/',
+        'https://www.wsj.com/',
+        'https://www.technologyreview.com/',
+        'https://www.cyberdefensemagazine.com/',
+        'https://www.securityweek.com/',
+        'https://www.darkreading.com/',
+        'https://www.infosecurity-magazine.com/',
+        'https://www.helpnetsecurity.com/',
+        'https://www.computerweekly.com/',
+        'https://www.csoonline.com/',
+        'https://www.zdnet.com/',
+        'https://www.itpro.co.uk/',
+        'https://www.theregister.com/',
+        'https://www.datacenterdynamics.com/',
+        'https://www.scmagazine.com/',
+        'https://www.teiss.co.uk/',
+        'https://www.tripwire.com/',
+        'https://www.infoworld.com/',
+        'https://www.cnet.com/',
+        'https://www.tomsguide.com/',
+        'https://www.theverge.com/',
+        'https://www.arstechnica.com/',
+        'https://www.engadget.com/',
+        'https://www.gizmodo.com/',
+        'https://www.wired.com/',
+        'https://www.vice.com/',
+        'https://www.politico.com/',
+        'https://www.theatlantic.com/',
+        'https://www.newyorker.com/',
+        'https://www.rollingstone.com/',
+        'https://www.thedailybeast.com/',
+        'https://www.salon.com/',
+        'https://www.slate.com/',
+        'https://www.huffpost.com/',
+        'https://www.vox.com/',
+        'https://www.bbc.co.uk/news',
+        'https://www.dailymail.co.uk/home/index.html',
+        'https://www.independent.co.uk/',
+        'https://www.irishtimes.com/',
+        'https://www.thejournal.ie/',
+        'https://www.thetimes.co.uk/',
+        'https://www.thesun.co.uk/',
+        'https://www.telegraph.co.uk/',
+        'https://www.euronews.com/',
+        'https://www.reuters.com/',
+        'https://www.dw.com/',
+        'https://www.france24.com/',
+        'https://www.lefigaro.fr/',
+        'https://www.lemonde.fr/',
+        'https://www.derstandard.at/',
+        'https://www.nzz.ch/',
+        'https://www.eldiario.es/',
+        'https://www.rtve.es/',
+        'https://www.rt.com/',
+        'https://www.elciudadano.com/',
+        'https://www.apnews.com/',
+        'https://www.univision.com/',
+        'https://www.televisa.com/',
+        'https://www.bbc.com/',
+        'https://www.cnn.com/',
+        'https://www.foxnews.com/',
+        'https://www.aljazeera.com/',
+        'https://www.trtworld.com/',
+        'https://www.newsweek.com/',
+        'https://www.time.com/',
+        'https://www.spectator.co.uk/'
+    ]
+
+    base_folder = '/var/www/theflows.net/flujos/FLUJOS_DATOS/NOTICIAS'
+    articles_folder = os.path.join(base_folder, 'articulos')
+    files_folder = os.path.join(base_folder, 'archivos')
+    tokenized_folder = os.path.join(base_folder, 'tokenized')
+
+    for folder in [articles_folder, files_folder, tokenized_folder]:
+        if not os.path.exists(folder):
+            os.makedirs(folder)
+
+    FOLDER_SIZE_LIMIT = 50 * 1024 * 1024 * 1024  # 50 GB
+
+    urls_to_process = register_processed_notifications(base_folder, urls)
+    processed_urls = set()
+
+    for url in urls_to_process:
+        logging.info(f"\nProcesando URL: {url}")
+        explore_and_extract_articles(url, articles_folder, files_folder, processed_urls, FOLDER_SIZE_LIMIT)
+        explore_wayback_machine(url, articles_folder)
+
+    process_files(files_folder, tokenized_folder)
+    tokenize_all_articles(articles_folder, tokenized_folder)
+
+    total_size_articles, total_files_articles = get_folder_info(articles_folder)
+    total_size_files, total_files_files = get_folder_info(files_folder)
+    total_size_tokenized, total_files_tokenized = get_folder_info(tokenized_folder)
+
+    logging.info("\nResumen del proceso:")
+    logging.info(f"Artículos descargados: {total_files_articles}")
+    logging.info(f"Tamaño total de artículos: {total_size_articles / (1024 * 1024):.2f} MB")
+    logging.info(f"Archivos descargados: {total_files_files}")
+    logging.info(f"Tamaño total de archivos: {total_size_files / (1024 * 1024):.2f} MB")
+    logging.info(f"Archivos tokenizados: {total_files_tokenized}")
+    logging.info(f"Tamaño total de archivos tokenizados: {total_size_tokenized / (1024 * 1024):.2f} MB.")
+
+if __name__ == "__main__":
+    main()
--- a/FLUJOS_DATOS/NOTICIAS/noticias_procesadas.txt
+++ b/FLUJOS_DATOS/NOTICIAS/noticias_procesadas.txt
@ -0,0 +1,5 @@
+https://aleph.occrp.org/
+https://offshoreleaks.icij.org/
+https://reactionary.international/database/
+https://www.elsaltodiario.com/
+https://www.publico.es/
--- a/FLUJOS_DATOS/NOTICIAS/processed_articles.txt
+++ b/FLUJOS_DATOS/NOTICIAS/processed_articles.txt
@ -0,0 +1,196 @@
+https://reactionary.international/database/
+https://aleph.occrp.org/
+https://offshoreleaks.icij.org/
+https://www.publico.es/
+https://www.elsaltodiario.com/
+https://www.nytimes.com/
+https://www.theguardian.com/
+https://www.lemonde.fr/
+https://www.spiegel.de/
+https://elpais.com/
+https://www.repubblica.it/
+https://www.scmp.com/
+https://www.smh.com.au/
+https://www.globo.com/
+https://timesofindia.indiatimes.com/
+https://www.asahi.com/
+https://www.washingtonpost.com/
+https://www.aljazeera.com/
+https://www.folha.uol.com.br/
+https://www.telegraph.co.uk/
+https://www.corriere.it/
+https://www.clarin.com/
+https://www.eluniversal.com.mx/
+https://www.welt.de/
+https://www.lanacion.com.ar/
+https://www.bbc.com/
+https://www.elconfidencial.com/
+https://www.expansion.com/
+https://www.lavanguardia.com/
+https://www.elperiodico.com/
+https://www.abc.es/
+https://www.elespanol.com/
+https://www.lainformacion.com/
+https://www.elcorreo.com/
+https://www.canarias7.es/
+https://www.diariovasco.com/
+https://www.farodevigo.es/
+https://www.lavozdegalicia.es/
+https://www.marca.com/
+https://www.mundodeportivo.com/
+https://www.elmundo.es/
+https://www.wired.com/
+https://www.techcrunch.com/
+https://www.cybersecurity-insiders.com/
+https://www.darkreading.com/
+https://www.hackread.com/
+https://www.theregister.com/
+https://www.csoonline.com/
+https://www.scmagazine.com/
+https://www.securityweek.com/
+https://www.infosecurity-magazine.com/
+https://www.hackaday.com/
+https://www.economist.com/
+https://www.ft.com/
+https://www.bloomberg.com/
+https://www.wsj.com/
+https://www.forbes.com/
+https://www.businessinsider.com/
+https://www.reuters.com/
+https://www.cnbc.com/
+https://www.nbcnews.com/
+https://www.cbsnews.com/
+https://www.abcnews.go.com/
+https://www.vox.com/
+https://www.politico.com/
+https://www.euronews.com/
+https://www.france24.com/
+https://www.rt.com/
+https://www.al-monitor.com/
+https://www.jpost.com/
+https://www.haaretz.com/
+https://www.middleeasteye.net/
+https://www.indiatoday.in/
+https://www.chinadaily.com.cn/
+https://www.japantimes.co.jp/
+https://www.koreatimes.co.kr/
+https://www.thehindu.com/
+https://www.nikkei.com/
+https://www.manilatimes.net/
+https://www.bangkokpost.com/
+https://www.theaustralian.com.au/
+https://www.nzherald.co.nz/
+https://www.theglobeandmail.com/
+https://www.torontostar.com/
+https://www.ctvnews.ca/
+https://www.globalnews.ca/
+https://www.thehill.com/
+https://www.breitbart.com/
+https://www.nationalreview.com/
+https://www.slate.com/
+https://www.newyorker.com/
+https://www.atlanticcouncil.org/
+https://www.chathamhouse.org/
+https://www.rand.org/
+https://www.cfr.org/
+https://www.brookings.edu/
+https://www.carnegieendowment.org/
+https://www.wilsoncenter.org/
+https://www.hoover.org/
+https://www.csis.org/
+https://www.heritage.org/
+https://www.aspi.org.au/
+https://www.iiss.org/
+https://www.rusi.org/
+https://www.intelligenceonline.com/
+https://www.sit.kb.gov.tr/
+https://www.securitymagazine.com/
+https://www.zdnet.com/
+https://www.helpnetsecurity.com/
+https://www.bankinfosecurity.com/
+https://www.nsa.gov/
+https://www.fbi.gov/
+https://www.mi5.gov.uk/
+https://www.mi6.gov.uk/
+https://www.mss.gov.cn/
+https://www.bnd.bund.de/
+https://www.cni.es/
+https://www.cis.es/
+https://www.dni.gov/
+https://www.mossad.gov.il/
+https://www.afp.gov.au/
+https://www.royalnavy.mod.uk/
+https://www.gov.uk/government/organisations/foreign-commonwealth-office
+https://www.cabinetoffice.gov.uk/
+https://www.janes.com/
+https://www.gov.uk/government/organisations/defence-intelligence
+https://www.nato.int/
+https://www.un.org/en/
+https://www.worldbank.org/
+https://www.imf.org/
+https://www.weforum.org/
+https://www.oecd.org/
+https://www.wto.org/
+https://www.unesco.org/
+https://www.who.int/
+https://www.icc-cpi.int/
+https://www.eurojust.europa.eu/
+https://www.europol.europa.eu/
+https://www.dia.mil/
+https://www.nro.gov/
+https://www.cia.gov/
+https://www.sis.gov.uk/
+https://www.interpol.int/
+https://www.intel.gov/
+https://www.financialtimes.com/
+https://www.wallstreetjournal.com/
+https://www.fortune.com/
+https://www.marketwatch.com/
+https://www.barrons.com/
+https://www.nasdaq.com/
+https://www.sec.gov/
+https://www.nyse.com/
+https://www.isda.org/
+https://www.technologyreview.com/
+https://www.cyberdefensemagazine.com/
+https://www.computerweekly.com/
+https://www.itpro.co.uk/
+https://www.datacenterdynamics.com/
+https://www.teiss.co.uk/
+https://www.tripwire.com/
+https://www.infoworld.com/
+https://www.cnet.com/
+https://www.tomsguide.com/
+https://www.theverge.com/
+https://www.arstechnica.com/
+https://www.engadget.com/
+https://www.gizmodo.com/
+https://www.vice.com/
+https://www.theatlantic.com/
+https://www.rollingstone.com/
+https://www.thedailybeast.com/
+https://www.salon.com/
+https://www.huffpost.com/
+https://www.bbc.co.uk/news
+https://www.dailymail.co.uk/home/index.html
+https://www.independent.co.uk/
+https://www.irishtimes.com/
+https://www.thejournal.ie/
+https://www.thetimes.co.uk/
+https://www.thesun.co.uk/
+https://www.dw.com/
+https://www.lefigaro.fr/
+https://www.derstandard.at/
+https://www.nzz.ch/
+https://www.eldiario.es/
+https://www.rtve.es/
+https://www.elciudadano.com/
+https://www.apnews.com/
+https://www.univision.com/
+https://www.televisa.com/
+https://www.cnn.com/
+https://www.foxnews.com/
+https://www.trtworld.com/
+https://www.newsweek.com/
+https://www.time.com/
+https://www.spectator.co.uk/