Estoy tratando de ordenar una lista de valores basados en subcadenas similares. Me gustaría agrupar esto en un dict de dictos de listas con una clave que es la subcadena similar y el valor de una lista de esos valores agrupados.

Por ejemplo (la lista real tiene 24k entradas):

test_list = [ 'Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna', 
        'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']

Para:

resultdict = { 
'Doghouse' : ['Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna'],
'by KatSkill' : [ 'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill' ]
}

Intenté lo siguiente pero eso no funciona en absoluto.

from itertools import groupby 
test_list = [ 'Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna', 
            'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']


res = [list(i) for j, i in groupby(test_list, 
                          lambda a: a.partition('_')[0])]
0
Dammas Groen 3 oct. 2019 a las 16:37

3 respuestas

La mejor respuesta

Inicialmente, busque todas las subcadenas separadas por "" que aparecen en otra cadena de la lista de entrada. En el proceso, cree un diccionario que contenga todas las subcadenas correspondientes como claves y las cadenas de entrada como valores. Esto devuelve un diccionario que tiene solo subcadenas simples como claves. Usando el ejemplo, esto devuelve:

{'by': ['Garden by KatSkill', 'Meadow by KatSkill', 'House by KatSkill'], 'KatSkill': ['Garden by KatSkill', 'Meadow by KatSkill', 'House by KatSkill'], 'Doghouse': ['Doghouse Antwerp', 'Doghouse Vienna', 'Doghouse Amsterdam']}

Para obtener el resultado deseado, se requiere compactación. Para la compactación, es beneficioso aprovechar el hecho de que cada clave del diccionario también forma parte de las listas de cadenas del diccionario. Por lo tanto, repita los valores del diccionario y divida las cadenas en subcadenas nuevamente. Luego, repita las subcadenas en el orden de la lista de subcadenas y determine los rangos de la lista de subcadenas que contienen claves de diccionario. Agregue los rangos correspondientes a un nuevo dict. Para entradas de 24k, esto puede tomar un tiempo. Vea el código de fuente a continuación:

mylist = [ 'Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna', 
        'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']

def findSimilarSubstrings(list):
    res_dict = {}
    for string in list:
        substrings = string.split(" ")
        for otherstring in list:
            # Prevent check with the same string
            if otherstring == string:
                continue
            for substring in substrings:
                if substring in otherstring:
                   if not(substring in res_dict):
                       res_dict[substring] = []
                   # Prevent duplicates
                   if not(otherstring in res_dict[substring]):
                       res_dict[substring].append(otherstring)
    return res_dict

def findOverlappingLists(dict):
    res_dict = {}
    for list in dict.values():
        for string in list:
            substrings = string.split(" ")
            lastIndex = 0
            lastKeyInDict = False
            substring = ""
            numsubstrings = len(substrings)
            for i in range(len(substrings)):
               substring = substrings[i]
               if substring in dict:
                    if not(lastKeyInDict):
                        lastIndex = i
                        lastKeyInDict = True
               elif lastKeyInDict:
                   commonstring = " ".join(substrings[lastIndex:i])
                   # Add key string to res_dict
                   if not(commonstring in res_dict):
                      res_dict[commonstring] = []
                   # Prevent duplicates
                   if not(string in res_dict[commonstring]):
                      res_dict[commonstring].append(string)
                   lastKeyInDict = False
            # Handle last substring
            if lastKeyInDict:
                commonstring = " ".join(substrings[lastIndex:numsubstrings])
                if not(commonstring in res_dict):
                    res_dict[commonstring] = []
                if not(string in res_dict[commonstring]):
                    res_dict[commonstring].append(string)
    return res_dict

# Initially find all the substrings (seperated by " ") returning:
# {'by': ['Garden by KatSkill', 'Meadow by KatSkill', 'House by KatSkill'],
#  'KatSkill': ['Garden by KatSkill', 'Meadow by KatSkill', 'House by KatSkill'],
#  'Doghouse': ['Doghouse Antwerp', 'Doghouse Vienna', 'Doghouse Amsterdam']}
similiarStrings = findSimilarSubstrings(mylist)
# Perform a compaction on similiarStrings.values() by lookup in the dictionary's key set
resultdict = findOverlappingLists(similiarStrings)
0
Draft25 3 oct. 2019 a las 16:51

Aquí hay una implementación quizás más simple / rápida

from collections import Counter
from itertools import groupby
import pprint

# Strategy:
# 1.  Find common words in strings in list
# 2.  Group strings which have the same common words together

def find_commond_words(lst):
  " finds strings with commond words "
  cnt = Counter()
  for s in lst:
    cnt.update(s.split(" "))

  # return words which appear in more than one string
  words = set([k for k, v in cnt.items() if v > 1])
  return words

def groupping_key(s, words):
  " Key function for groupping strings with common words in the same sequence"
  k = []
  for i in s.split():
    if i in words:
      k.append(i)
  return ' '.join(k)

def calc_groupings(lst):
  " Generate the string groups based upon common words "
  common_words = find_commond_words(lst)

  # Group strings with common words
  g = groupby(lst, lambda x: groupping_key(x, common_words))

  # Result
  return {k: list(v) for k, v in g}

t = ['Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna', 
        'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(calc_groupings(t))

Salida

{   'Doghouse': ['Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna'],
'by KatSkill': [   'House by KatSkill',
                   'Garden by KatSkill',
                   'Meadow by KatSkill']}
0
DarrylG 5 oct. 2019 a las 19:33
mylist = [ 'Doghouse Amsterdam', 'Doghouse Antwerp', 'Doghouse Vienna', 
            'House by KatSkill', 'Garden by KatSkill', 'Meadow by KatSkill']
test = ['Doghouse', 'by KatSkill']

Usa dict y comprensión de listas:

res = { i: [j for j in mylist if i in j] for i in test}

O configure su dict y use un bucle con comprensión de lista

resultdict = {}
for i in test:
     resultdict[i] = [j for j in mylist if i in j]
0
Craicerjack 3 oct. 2019 a las 14:05
58220374