Tengo una cadena json:

>>> a = '[{\\\"pic\\\": \\\"QmdYSopPxh46rQ5MjyMK5uw2sBKYVwjUNVoyKFYHb1cR97\\\", \\\"note\\\": \\\"\\\\u8aaa\\\\u660e1\\\", \\\"location\\\": \\\"\\\\u6c34\\\\u6c60\\\"}, {\\\"pic\\\": \\\"QmdNGrc1S9paXycnH7ogdB8w7qDUcWnEFJMPe1Wfb9fYyP\\\", \\\"note\\\": \\\"\\\\u8aaa\\\\u660e2\\\", \\\"location\\\": \\\"\\\\u6a4b\\\\u6a11\\\"}]'
>>> type(a)
<class 'str'>

Me gustaría eliminar \\ pero aún mantener las secuencias de escape Unicode. Eventualmente use json.loads para convertir en el dict / list de python. ¿Cómo puedo hacerlo?

Intenté tres métodos pero no funcionó:

  1. a.replace('\\', '')

    Puede eliminar el '\' pero de alguna manera mi notación unicode se ha ido.

    >>> a.replace('\\', '') result seems OK but lost the unicode notation
    '[{"pic": "QmdYSopPxh46rQ5MjyMK5uw2sBKYVwjUNVoyKFYHb1cR97", "note": "u8aaau660e1", "location": "u6c34u6c60"}, {"pic": "QmdNGrc1S9paXycnH7ogdB8w7qDUcWnEFJMPe1Wfb9fYyP", "note": "u8aaau660e2", "location": "u6a4bu6a11"}]'
    
  2. json.loads(a) recibió un mensaje de error

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
    File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
    json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 3 (char 2)
    
  3. a.decode('utf-8')

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    AttributeError: 'str' object has no attribute 'decode'
    
0
SamTT 9 dic. 2019 a las 05:37

2 respuestas

Si solo necesita eliminar las barras invertidas y mantener unicode:

import re

a = '[{\\\"pic\\\": \\\"QmdYSopPxh46rQ5MjyMK5uw2sBKYVwjUNVoyKFYHb1cR97\\\", \\\"note\\\": \\\"\\\\u8aaa\\\\u660e1\\\", \\\"location\\\": \\\"\\\\u6c34\\\\u6c60\\\"}, {\\\"pic\\\": \\\"QmdNGrc1S9paXycnH7ogdB8w7qDUcWnEFJMPe1Wfb9fYyP\\\", \\\"note\\\": \\\"\\\\u8aaa\\\\u660e2\\\", \\\"location\\\": \\\"\\\\u6a4b\\\\u6a11\\\"}]'
print (a)
print ('\n')

b = re.sub(r'\\"', '"', a)
b = re.sub(r'\\\\u', r'\\u', b)
print (b)

Da:

[{\"pic\": \"QmdYSopPxh46rQ5MjyMK5uw2sBKYVwjUNVoyKFYHb1cR97\", \"note\": \"\\u8aaa\\u660e1\", \"location\": \"\\u6c34\\u6c60\"}, {\"pic\": \"QmdNGrc1S9paXycnH7ogdB8w7qDUcWnEFJMPe1Wfb9fYyP\", \"note\": \"\\u8aaa\\u660e2\", \"location\": \"\\u6a4b\\u6a11\"}]

[{"pic": "QmdYSopPxh46rQ5MjyMK5uw2sBKYVwjUNVoyKFYHb1cR97", "note": "\u8aaa\u660e1", "location": "\u6c34\u6c60"}, {"pic": "QmdNGrc1S9paXycnH7ogdB8w7qDUcWnEFJMPe1Wfb9fYyP", "note": "\u8aaa\u660e2", "location": "\u6a4b\u6a11"}]

Si necesita trabajar con esos datos más tarde, puede tener problemas para convertir a json ya que tiene una matriz de 2 diccionarios. Lo resolvería así:

import json
import re

a = '[{\\\"pic\\\": \\\"QmdYSopPxh46rQ5MjyMK5uw2sBKYVwjUNVoyKFYHb1cR97\\\", \\\"note\\\": \\\"\\\\u8aaa\\\\u660e1\\\", \\\"location\\\": \\\"\\\\u6c34\\\\u6c60\\\"}, {\\\"pic\\\": \\\"QmdNGrc1S9paXycnH7ogdB8w7qDUcWnEFJMPe1Wfb9fYyP\\\", \\\"note\\\": \\\"\\\\u8aaa\\\\u660e2\\\", \\\"location\\\": \\\"\\\\u6a4b\\\\u6a11\\\"}]'
print (a)

dictionaries = []

substrings_for_dictionaries = a.split(r'}, {')

for substring in substrings_for_dictionaries:
    substring = re.sub(r'[{}]', '', substring)
    substring = re.sub(r'[\[\]]', '', substring)
    substring = re.sub(r'\\"', '"', substring)
    substring = re.sub(r'\\\\u', r'\\u', substring)
    substring = '{' + substring + '}'
    dictionary = json.loads(substring)
    dictionaries.append(dictionary)


for dictionary in dictionaries:
    print (dictionary)

Como resultado, da:

[{\"pic\": \"QmdYSopPxh46rQ5MjyMK5uw2sBKYVwjUNVoyKFYHb1cR97\", \"note\": \"\\u8aaa\\u660e1\", \"location\": \"\\u6c34\\u6c60\"}, {\"pic\": \"QmdNGrc1S9paXycnH7ogdB8w7qDUcWnEFJMPe1Wfb9fYyP\", \"note\": \"\\u8aaa\\u660e2\", \"location\": \"\\u6a4b\\u6a11\"}]
{'pic': 'QmdYSopPxh46rQ5MjyMK5uw2sBKYVwjUNVoyKFYHb1cR97', 'note': '說明1', 'location': '水池'}
{'pic': 'QmdNGrc1S9paXycnH7ogdB8w7qDUcWnEFJMPe1Wfb9fYyP', 'note': '說明2', 'location': '橋樑'}
0
Yehor 9 dic. 2019 a las 03:59

Personalmente, usaría un analizador del lenguaje en el que se extrajo la cadena, pero como no mencionó, recurro a la decodificación de escape de cadena de códecs de Python para hacer el trabajo. Debería funcionar para la mayoría de los casos, pero podría romperse en casos extremos donde los idiomas difieren en las secuencias de escape compatibles.

import codecs
import json

s = '[{\\\"pic\\\": \\\"QmdYSopPxh46rQ5MjyMK5uw2sBKYVwjUNVoyKFYHb1cR97\\\", \\\"note\\\": \\\"\\\\u8aaa\\\\u660e1\\\", \\\"location\\\": \\\"\\\\u6c34\\\\u6c60\\\"}, {\\\"pic\\\": \\\"QmdNGrc1S9paXycnH7ogdB8w7qDUcWnEFJMPe1Wfb9fYyP\\\", \\\"note\\\": \\\"\\\\u8aaa\\\\u660e2\\\", \\\"location\\\": \\\"\\\\u6a4b\\\\u6a11\\\"}]'
unescaped = codecs.decode(s, 'unicode-escape')
obj = json.loads(unescaped)
0
nhahtdh 9 dic. 2019 a las 04:27